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Abstract 

Background: The gene Polymorphic derived intron-containing, known as PIdi, is a long non-coding RNA (IncRNA) 
first discovered in mouse. Although parts of its sequence were reported to be conserved in rat and human, it can 
only be expressed in mouse testis with a mouse-specific transcription start site. The consensus sequence of PIdi is 
also part of an antisense transcript AK1 58810 expressed in a wide range of mouse tissues. 

Result: We focused on sequence origin of PIdi and Akl588IO. We demonstrated that their sequence was originated 
from an inter-genic region and is only presented in mammalians. Transposable events and chromosome 
rearrangements were involved in the evolution of ancestral sequence. IVloreover, we discovered high conservation 
in part of this region was correlated with chromosome rearrangements, CpG demethylation and transcriptional 
factor binding motif These results demonstrated that multiple factors contributed to the sequence origin of PIdi. 

Conclusions: We comprehensively analyzed the sequence origin of PIdi-AkI 58810 loci. We provided various factors, 
including rearrangement, transposable elements, contributed to the formation of the sequence. 



Introduction 

Although pervasively transcribed, only 5%-10% of the 
human genome is covered by mRNA and spliced non- 
coding RNAs, and the majority of which does not 
encode proteins [1]. Long non-coding RNAs (IncRNAs) 
are defined as transcribed non-coding RNA larger than 
200 nt in length, which plays an essential role in regu- 
lating gene expression, chromatin functions [2]. As 
IncRNAs act as biological building blocks, it is necessary 
to understand the process of developing new IncRNA 
genes [3]. 

The emergence of a functional IncRNA gene could be 
summarized into various evolutionary scenarios, includ- 
ing metamorphosis of a protein coding gene, derived 
from a genomic region previously devoid of exonic 
sequence, duplication by retro-transposition, and emer- 
gence following tandem duplication or insertion of 



* Correspondence: gwdingOsibs.ac.cn; yxliOsibs.ac.cn 
t Contributed equally 

^Key Laboratory of Systems Biology, Shanghai Institutes for Biological 
Sciences, Chinese Academy of Sciences, 320 Yueyang Rd. Shanghai 200031, 
PR China 

Full list of author information is available at the end of the article 

© 2013 Dai et al.; licensee BiolVled Central Ltd. This is an Open Access article distributed under the terms of the Creative Comrrions 
Attribution License (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in 
any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http:// 
creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless othen^/ise stated. 




Genomics 



transposable elements [1,4]. For most of the scenarios, 
comprehensive studies have been established on specific 
IncRNA genes with well-known functions, such as XIST, 
HOTAIR [5,6]. However, little was known about devel- 
oping a new IncRNA gene from a non transcribed geno- 
mic region. The mechanism of the de novo origin of a 
IncRNA gene remains to be clarified. 

Previous study on de novo protein has accounted for 
that those seemingly dispensable sequences in non-genic 
regions could generate adaptive functional proteins 
through evolution. The de novo birth and development of 
a potential protein coding gene is in line with increasing 
open reading frame (ORF) length and conservation 
through the natural selection benefited from random 
translation on genome [3,7,8]. Like proteins, the occa- 
sional transcription and changing events in non-genic 
sequences could provide raw material generating de novo 
IncRNAs [9]. Here, we focused on the sequence origin of 
a IncRNA in an intergenic region, demonstrating its 
sequence components and changes within species. 

PIdi gene was previously identified and defined as an 
intergenic originated IncRNA gene, which is overlapped 
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with a putative opposite-strand transcript, AK158810 
(Additional File 1). Pldi locates within a 200 kb region 
that is free of annotated transcripts or expressed 
sequence tags (ESTs) in rat and humans, which raise the 
possibility of de novo emergence of the Pldi-AK158810 
loci (about 20 kbps-long). Knocking out Pldi would 
reduce sperm motility and testis weight, indicating that 
Pldi has the ability in regulating the expression levels of 
other genes in testis [10]. Numerous functional non- 
coding RNAs have been demonstrated to regulate gene 
expressions through an antisense mechanism, playing an 
important role of gene overlapping in non-coding RNA 
functions [11-13]. On the contrary, few studies discussed 
the origin of overlapping non-coding RNAs due to lack- 
ing of clear markers, like ORF in protein. 

In this study, we conducted a comprehensive analysis 
on the sequence origin of mouse Pldi-Akl58810 loci. 
We evaluated various factors that contribute to the ori- 
gin, and gave adequate evidence to prove the de novo 
origin of this loci. Moreover, we found that Pldi- 
AklS8810 established its fixation from a specific over- 
lapping region some time before emergence. We further 
discussed the potential role of the local element in the 
evolution and fixation of this orphan IncRNA gene loci. 

Materials and methods 

Genomes and sequences 

The 13 genomes of vertebrates used in this study were 
downloaded from UCSC genome database http://hgdown- 
load.soe.ucsc.edu/downloads.html. Genome versions of 
these 13 genomes are in Additional File 2. The sequence 
of Pldi-AklS8810 loci was picked from mouse (GRCm 38) 
export data in EnsembI http://www.ensembl.org. 

ORF analysis 

Sequences of EST AklS8810 were checked to find all 
the potential open reading frames, by using ORF finder 
(Open Reading Frame Finder) by default minimum 
frame size. The ORF finder is accessible in this website 
server http://www.ncbi.nlm.nih.gov/ gorf/gor£html [ 14] 

Sequence comparison and alignment, phylogeny analysis 

We used nucleotide Blast (Basic Local Alignment Search 
Tool) to detect homology between Pldi-Akl58810 
nucleotide sequence and vertebrate genomes, a cutoff 
for identity was set at 80%. Protein Blast was used to 
find protein coding genes homologue to the genes flank- 
ing with Pldi and AklS880. ClustalW http://www.clustal. 
org/download/current/ was used to align protein and 
nucleotide sequences [15]. MEGA5.1 was used to con- 
struct neighbor-joining phylogenic tree [16]. The geno- 
mic alignment of 30 vertebrates by MultiZ was 
downloaded from UCSC [17,18]. All genomes were 
mapped to the mouse chromosomes. 



Repeats and transposable elements annotation 

Repeats and transposable elements were annotated by 
Repeatmasker program. Sequences of Pldi-Akl58810 
were submitted to the Repeatmasker website http://www. 
repeatmasker.org version 4.0.1, which uses default para- 
meters. The repeat class were transformed and grouped 
as SINE, LINE, DNA, LTR and others. In the analysis of 
ancient transposable elements, we did not include simple 
repeats and low complexity sequences [19]. 

Model for substitution rate change 

A simple model was constructed to test the substitution 
rate change relative to exons of surrounding genes. 
Three constant substitution rates were defined as: Tq, 
rate before inversion; ri, rate after inversion and before 
gene birth; r2, rate after gene birth. As surrounding 
genes' reference sequence, the rates were summarized as 
rR encompassing the whole phylogenetic tree. We 
assumed the substitution rates relative to those exons 
will not change significantly if no selection pressures 
affect this region. A time interval was estimated instead 
of an exact time point, of the inversion and gene birth, 
so we used two variables ki, k2, ranging from 0~1 to 
reflect the timing of two events, ki is the proportion of 
the time from present to the common ancestor of 
human and rat, in which the inversion has occurred. k2 
is the proportion from present to the common ancestor 
of mouse and rat, for the emergence of IncRNA tran- 
scription. The distance between species X and Y dxy 
can be approximately calculated as dxY = 2-rxY'txY > if 
the rate is regarded as constant. Under the assumptions 
of our model (Figure 1), for test sequence we get (): 



d^'^'^HD = 2 • ro • tHD/ 



d*^'*HR = (1 + ki) • ro • tHR + (1 - ki) • ii • tHR, 



= (1 + ka) • ri • tMR + (1 - ka) • ri ■ tMR, 



(1) 



(2) 



(3) 



where d'^'^HD. d'^'^HR. d'^'^'wR are the genetic distance 
of the test sequences between human and dog, human 
and rat, mouse and rat, respectively, tnoi ^hr> tMR repre- 
sent for the divergent time between each pair of species. 

For reference sequence. 



jRef 



HD 



2 • Tr • tHD/ 



d'^W = 2 • Tr • tHR, 



d'^'^'^MR = 2 • Ir • tMR/ 



(4) 



(5) 



(6) 



where d'^'^^HD) d'^'^^HR/ and d'^^^MR are the genetic dis- 
tance of the reference exon sequences between human 
and dog, human and rat, mouse and rat. 
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Figure 1 Assumptions of the model. We used surrounding gene exons as reference. We assumed the substitution rate relative to tinose exons 
would not change too much if no selection pressures affect this region. For test sequence, we defined three constant substitution rates on the 
species tree, to test the rate change at two time point; inversion and emergence of Pldi-Akl 588W transcription, (a) Red lines represent for 
substitutions after gene birth, at rate r2, green lines represent for substitutions after two observed inversion, and before gene birth, at rate ri; black 
ines represent for substitutions before inversion, at rate ro. We were concerned about the rate change relative to surrounding gene exons, all the 
rates were normalized by the variable reference rate tR. (b-d) ra„erxY were calculated by dividing test sequence XY distance by reference XY distance, 
which indicates a relative substitution rate on the bold path; three raverXY will be similar if substitution rates do not vary significantly in three stages. 



We calculated three average substitution rates, 



TaverHD = 'I'^'^^'^hd/ d'^'^'^HD = To/ TR/ 



TaverHR = 



r/ = 1(1 + ki) ■ ro + (1 - ki) ■ n]/ iR, 



[(l+k2)ri + (l-k2)r2|/rR, 



(7) 



(8) 



(9) 



where 0 < ki, <1. raverHD, faverHR and ravei-MR are the 
average substitution rates at different stages. If r^verHR < 
(>) TaverHR ) We get ro<(>) rj. Similarly, a reduced r2 will 
produce a lower raverMR> as the only envolving path 
affected by r2- 

We used ClustalW to realign the conserved elements in 
Pldi-Akl58810 and exons of surrounding four genes, 
manually remove sites with low similarity by Bioedit 
http://www.mbio.ncsu.edu/bioedit/bioedit.html. All the 
four genes were merged into one single alignment. Then a 
Maximum Likelihood (ML) tree and distance matrix was 
estimated by PAML 4.6 baseml for each alignment) [20]. 



Methylations data 

We collected two sources of methylation data as a com- 
parison, one is from mouse tissue, the other is from 
human ENCODE data. 

Mouse brain methylation data was obtained from fore- 
brain tissue of lab mouse (GSM809309) 

The probability of methylation was estimated with 
both methylated and unmethylated fragment informa- 
tion (Additional File 3) [21]. 

Demethylation data from human UCSF brain methyla- 
tion database viewed with UCSC genome browser was 
implemented to detect the DNA methylation in the 
human homologue region of Pldi-Akl58810 loci, which 
was displaced in Additional file 3 [22]. 

RNA-seq data 

RNA-seq data is from Encode Cold Spring Harbor Lab 
(CSHL) RNA-seq, and there are 5 types of tissues 
included (heart, kidney, ovary, spleen and testis). We 
viewed this data using UCSC genome browser [23]. 
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Transcriptional factor binding site data 

Human, Mouse, Rat (HMR) Conserved Transcriptional 
Factor Binding Site (TFBS) was implemented to displace 
the potential binding sites of these two highly conserved 
regions [24]. 

http://www.biobase-international.com/library/transfac 
Results 

The co-emergence of PIdi and Ak158810 transcription 

We studied the emergence time of Pldi and Akl58810. 
Pldi locates in an inter-genic region free of any human 
and rat EST signals, indicating that Pldi and its antisense 
putative gene generating AklS8810 were not transcribed 
before the divergence of rodents. In mouse lineages, RNA 
transcript of Pldi has been discovered [10]. To validate the 
transcript Akl58810, we compared its 2.9 kb sequence 
with mouse EST database from NCBI. The result con- 
firmed the transcription of Pldi antisense strand, and EST 
hits matched with splicing of the first and second exons of 
Akl58810 (Additional File 4). We further analyzed the 
open reading frames (ORFs) in Akl58810 RNA, The long- 
est ORF is shorter than 110 amino acids. (Two AUG 
codons with shorter reading frames -70 amino acids 



preceded this long ORF) (Additional File 5). It indicates 
that Akl58810 is not lil<ely to encode proteins. Our results, 
along with previous knowledge [10], showed that the Pldi 
and Akl58810 are two mouse-specific IncRNAs located on 
anti-sense strand to each other. These evidences suggest 
that Pldi, and its putative antisense IncRNA, AklS8810, 
were first transcribed at similar time between the diver- 
gence of mouse and rat. 

Pldi-Ak158810 loci is conserved in mammals and 
originated from an intergenic region 

To study the evolution of Pldi-Akl58810 loci, we 
searched for homologues of Pldi and Akl58810 loci in 
13 vertebrates. First, homologs of Pldi-Akl58810 
sequence were found in all mammals by using Blastn. 
Except a transposable element in rat and mouse, all the 
homologs are between the region of unc5b and pcbdl 
in mammalian classes. It demonstrates that parts of 
Pldi-Akl58810 loci were already present in the mamma- 
lian cen-ancestor. Meanwhile, we failed to detect signifi- 
cant sequence similarity to Pldi-Akl58810 loci in non- 
mammalian vertebrates with Blastn (Figure 2 & Addi- 
tional File 6). 
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Figure 2 Phylogenetic distribution of Pldi-Akl 58810 loci and its surrounding genes within vertebrate species The phylogenetic tree of 
13 vertebrates was adapted from a widely accepted tree topology [5,32], The branch length does not represent the distance between each 
species and no molecular clock model was assumed. Different gene highlighted with marks of different colors. We could find the hits from 
Blastn in all mammals, whereas no hits in non-mammalian species. We could observe that all these 4 flanking protein are ordered laid around 
the Pidi-Akl588W loci. In contrast, in non-mammalian species, some big gaps (larger than the average distance -200 kb in mammals) inserted 
into the 4 flanking protein region, which made the order of the 4 proteins changed. The gap in Xenopus might due to the incomplete genomic 
description. Result of Blastn could be found in Additional File 6. 
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In mouse, a -450 kb genomic region surrounding 
Pldi and AklS8810, contains four protein coding genes 
(slc29a3, unc5b, pcbdl and sgpU) that have ortholo- 
gous genes in vertebrates (Figure 2 & Figure 3). In 12 
vertebrates, the gene content, order and orientation of 
four flanking genes are perfectly conserved. In zebra- 
fish, the linkage of the four genes is broken (Figure 2). 
However, this does not correspond to the ancestral 
state of non-tetrapod vertebrates, for it is the only mis- 
match among three kinds of fish (fugu, tetraodon, zeb- 
rafish). These results denoted that the genomic region 
where Pldi and AklS8810 emerged is stable since the 
cen-ancester of vertebrates, which indicated that Pldi- 
Akl58810 loci originated from an intergenic region lin- 
kaged in non-tetrapod vertebrates and remained con- 
servation in mammals 

The origins of exons and introns 

Based on the exons and introns of Pldi and Akl58810, 
we further analyzed the origin of them in mouse, rat, 
human, dog, opossum and chicken (Figure 4 & Figure 
5). Pldi consists of 3 exons and 2 introns, and Akl58810 
consists of 5 exons and 4 introns. In non-rodent species, 
two fragments in intron 1 of Pldi were detected. Then, 



we used MultiZ alignment to compare Pldi and its 
homologues regions in mouse, rat, human, dog, chicken 
and opossum. We discovered that the majority of the 
mouse Pldi-AklS8810 region could be aligned to rat, 
human, and dog, including exons and introns. In opos- 
sum, no fragment was mapped to the three Pldi exons. 
Akl58810 exon 1 and part of exon 3 are covered by 
opossum homologues, matching with the conserved ele- 
ments detected by our previous analysis (Figure 2). In 
chicken, few homologue was detected, except partial 
AklS8810 exon 3. 

The sequence alignment also revealed that were 
involved in the evolution of Pldi region. We identified 
two inversions at the loci. First one is the inversion of a 
~800 bps fragment, containing the first exon of 
Akl58810 (Figure 5 & Addition File 1). Another inver- 
sion is overlapped with Akl58810 exon 5. The regions, 
homologue to the two inversed fragments of non-rodent 
mammalians, are in opposite direction to those of 
mouse, which reveals both inversions occurred before 
the divergence of mouse and rat, and after the diver- 
gence of primates and rodents. Interestingly, the first 
inversed region is highly conserved, which is discussed 
in following section. 
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Figure 3 Clustering paralogues and orthologues of 4 flanking proteins. We used clustering to identify the paralogues and ortliologues of 
the 4 flanl<ing proteins (a: slc29a3, b: uncSb, c: pcbdl, d: sgpll) in 4 species (mouse, rat, human and chicken). The result beyond indicates these 
Pldi-AklSSSW loci surrounding regions linkaged in these 4 species and these 4 flanking genes have been free from duplication of their 
paralogues since the time of their cen-ancestor. Sgpll doesn't contain a paralogue and it has the same topological structure as the other genes. 
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Figure 4 Alignment of mouse and other five species (rat, human, dog, opossum, chicl<en): Positions of transposable elements at Pldi- 
Ak158810 loci. The linkages between two cliromosome fragments mean one or a few adjacent blocl<s in the alignment, and red ones mean 
inversion. Various class of transposable elements are plot in five colors. SINE, short interspersed nuclear elements; LINE, long interspersed nuclear 
elements; DNA, DNA repeat elements; LTR, long terminal repeat elements, including retroposons; Others, other types of repeat sequences, 
including simple repeat, sequences with low complexity. Transposable elements exist in at least two species including mouse were annotated 
with character A-K. 



Transposable events contribute to the formation of Pldi- 
Ak1588W loci 

Transposable elements (TEs) have been considered as 
an important composition in the genome [25], we then 
evaluated the contribution of transposons to the forma- 
tion of Pldi-Akl58810 loci. We compared the sequence 
of mouse Pldi-Akl58810 loci and homologous 
sequences in other 5 species (rat, human, dog, opossum 
and chicken) with the database of mobile elements, 
using RepeatMasker program [19]. To understand 
whether Pldi-Akl58810 loci is interrupted by ancient 



TEs, we manually checked and listed the eleven TEs, 
which exist in at least two species including mouse 
(Table 1). No TE was found in opossum and chicken, 
possibly because of the low homology between this two 
species and mouse. 

In Pldi exon 1 to 3 and Akl58810 exon 1 to 4, no 
ancient TE was detected. However, almost half of all the 
defined ancient TEs locate in Akl58810 exon 5 (Table 2 & 
Figure 4). Inside the longest Pldi intron 1, four ancient 
TEs were identified, three of which also locate in the 
overlapped AklS8810 intron region (Table 2 & Figure 4). 
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Figure 5 Formation of Pldi-Akl58810 exon sequences. Green pentagons represent for homologue sequences match with specific mouse 
exons, red ones represent for inversed exons. Pentagons in bold lines suggest homologues with low similarity or low percent of coverage. 



The data shows no evidence that insertions of TEs have 
been involved in most of the exons at the loci during 
recent period of time, except (or Akl58810 exon 5. The 
formation of the last exon of Akl58810 and intron 
sequences of both PIdi and Akl58810 are associated with 
various types of transposable events. 

An inverse element is highly conserved and obtains a 
reduced substitution rate after rearrangement 

From genomic sequence in mammals, we noticed the 
Pldi-Akl58810 loci was interrupted by chromosome 
rearrangement in a period of time before its emergence 



in mouse lineage. Interestingly, one of the rearranged 
fragments associated with Akl58810 exon 1 is highly 
conserved among species. From this point of view, we 
estimated the substitution rate of this highly conserved 
region among species to test whether inversion contri- 
bute to fixation of local region. To better learn the evo- 
lution of this loci, we examined the change in 
substitution rate during the fixation of 4 species, mouse, 
rat, human and dog. Taking the exon sequences of 
flanking genes (pcbdl, slc29a3, sgpll, uncSb) as a refer- 
ence, we constructed a simple model to test the rate 
change at two time points: the point of chromosome 



Table 1 Transposable elements (TEs) that contributed to 
the formation of ancestral Pldi-Akl58810 sequences. 



TE symbol 


TE name 


Repeat Class 


Species with the TE 


A 


MIR 


SINE 


mouse, rat, human, dog 


B 


MIR 


SINE 


mouse, rat, human, dog 


C 


Chap1_Mam 


DMA 


mouse, human 


D 


MER91A 


DMA 


mouse, rat, human, dog 


E 


URR1B 


DMA 


mouse, rat 


F 


MTEb 


LTR 


mouse, rat 


G 


MIR3 


SINE 


mouse, rat, human 


H 


MT2B2 


LTR 


mouse, rat 




B1F 


SINE 


mouse, rat 


J 


PB1D10 


SINE 


mouse, rat 


K 


L1MD3 


LINE 


mouse, human, dog 



Only TEs in at least 2 species including mouse are listed. SINE, short 
interspersed nuclear elements; LINE, long interspersed nuclear elements; DNA, 
DNA repeat elements; LTR, long terminal repeat elements, including 
retroposons. 



Table 2 Gene composition originated from TE 


Gene 


Element 


Origin from TE 


PIdi 


exon 1 






exon 2 






exon 3 






intron 1 


A, B, C, D 




intron 2 




Akl58810 


exon 1 






exon 2 






exon 3 






exon 4 






exon 5 


G, H, 1, 1, K 




intron 1 


B, C 




intron 2 


D 




intron 3 






intron 4 


E, F 



A, B, C, D, E, F, G, H, I, J, K represent different TE used in Table 1. 
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Table 3 A simplified model to test the change of substitution rate at two time point: occurrence of the inversion and 
emergence of PIdi and Ak158810 gene. 



Reference 


Test sequence 


raverHD 


raverHR 


raverMR 


Exons of surrounding genes 


Conserved element 1 (inversed region) 


2.5971 


2.0603 


1.406 


Exons of surrounding genes 


Conserved element 2 


1.6694 


2.1832 


1.5047 



Normalized average substitution rates during specific lineage (e.g. "HD" stands for liuman and dog) are calculated. 



rearrangement and the emergence of Pldi and AklS8810 
in mouse lineage (Figure 5). We extracted sequences of 
two conserved elements (CEl, conserved element 1 in 
the inversion; CE2, conserved element 2 near Akl58810 
exon 3), which could be detected by Blastn in distant 
organisms. 

Compared with surrounding genes (Additional File 7 & 
Table 3), both CEl and CE2 obtain the lowest normal- 
ized rates during mouse-rat divergence, in Une with result 
from a previous study that purifying selection is acting on 
Pldi region after its emergence [10]. Furthermore, for 
CEl, the average rate of human-dog divergence is higher 
than that of human-rat, which implies the substitution 
rate of this element was slightly reduced after rearrange- 
ment. For CE2, not involved in rearranged regions, the 
tendency is opposite. The data shows the possibility that 
the specific elements of the Pldi-Akl58810 loci estab- 
lished their fixation at an early time before the gene 
birth. Inversion of CEl may contribute to its acquisition 
of purifying selection, causing a slightly reduced substitu- 
tion rate. 

Discussion 

Various factors contribute to the formation of Pldi- 

Ak158810 sequence 

A new IncRNA gene could emerge through different sce- 
narios, such as metamorphosis from a protein-coding 
gene, interrupted by tandem repeat and transposable ele- 
ments, and de novo origin from an intergenic region. Our 
analysis further confirmed the inter-genic origin of Pldi- 
AklSSSlO sequence without any clues of long genomic 
duplication in a recent past. Tracing back in history, both 
transposable events and chromosome rearrangements 
were found in the region. In conclusion, the formation of 
the Pldi-Akl58810 loci, which became a pair of IncRNA 
genes in mouse lineage, was affected by multiple factors. 

Fixation of partial Pldi-Akl58810 sequence before gene 
birth 

A previous study indicates that the conservation of non- 
coding RNA is only slightly higher than that of inter- 
genic region [10]. In Pldi region, reduced polymorphism 
has been detected in specific mouse lineage, which sug- 
gests the present of purifying selection. Nevertheless, we 
found in our data that partial Pldi-Akl58810 sequence 
is conserved in all mammalians. It raises the possibility 



that purifying selection may be acquired in partial Pldi- 
AklS8810 region much earlier than the gene birth. 

We checked factors that could be responsible to the 
early fixation. Our calculation of substitution rate 
change shows that the inversed Akl58810 exon 1 was 
prone to decreasing the evolutional ability after inver- 
sion event, relative to surrounding genes (Table 3 & 
Additional File 7). This trend may represent for an 
increasing natural selection [25-27]. We also checked 
DNA modification of the region in human. A series sig- 
nificant signals of demethylation in CpGs are highly cor- 
related with the conserved inversed element, CEl 
(Figure 6 & Additional File 3) using Encode browser 
[28], CEl is overlapped with the promoter region of 
Pldi 's antisense gene, Akl58810, and the promoter 
sequence in mouse was found with low DNA methyla- 
tions [21]. Furthermore, from the transcription factor 
binding site conserved tracks in UCSC, we find this CEl 
homologue site is a potential transcription factor bind- 
ing site of ChxlO conserved in both human and mouse 
(Figure 6) [24]. This binding site exists both in human 
and mouse. 

According to these observations, we suggested in spe- 
cies other than mouse, partial region of Pldi-AklS8810 
loci could not be simply recognized as "non-functional" 
before the birth of Pldi. 

Birth order of AK1 58810 and Pldi 

It has been known that two neighboring genes may 
form a transcriptional unit [11], which is correlated with 
expression. As for this case, we assumed the earlier 
developed IncRNA might influence the birth of the 
other one by expression level. We attempted to detect 
the birth order of AK158810 and Pldi, According to pre- 
vious studies, the birth order of Akl58810 and Pldi may 
not quite clear for the following reasons: first, testis 
where Pldi was born has been considered as an impor- 
tant organ for the emergence of a novel gene [4,29]. 
According to RNA-seq data (CSHL) and previous study 
[10], Pldi is a testis-specific IncRNA, while Akl58810 is 
likely to have a wide expression range, such as heart, 
spleen and kidney (Additional File 8). That indicates 
that Akl58810 seems to be a not that young gene as 
Pldi [4,30]; Second, considering northern blot experi- 
ment, Pldi exists in more species or lineages in mouse 
testis [10,31], inferring that it is more likely to be older 
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CE2 CEl 
Figure 6 The demethylation signals and predicted TFBS found in human imply a transcribe tendency in these area A strong DNA 
demethylation signal (green) was found in CEl region. Tlie HMR conserved TFBS displayed that eacli blacl< line represents one conserved 
putative TFBS, which is conserved in human/mouse/rat alignment The red rectangle demonstrated the consistence of CEl with potential 
function signals, like demethylation, TFBS (black) and inversion (yellow arrow direction). 



than Akl58810. The conflict result, together with the 
phenomenon, that the expressions of both IncRNAs are 
limited in mouse, demonstrated that AK158810 and Pldi 
were newly transcribed IncRNAs in a similar age after 
the divergence of mouse and rat. 

Conclusion 

In this study, we comprehensively analyzed the sequence 
origin of a IncRNA antisense gene pair, Pldi-Akl58810. 
We found out various factors, including rearrangement, 
transposable elements, contributed to the formation of 
the sequence. We also figured out partial sequence of 
the entire loci is highly conserved in mammalians before 
the birth and provided evidences and correlated factors 
for the early fixation of conserved elements. 

Lists of abbreviation 

Pldi: Polymorphic derived intron-containing; IncRNA 
long non-coding RNA; ESTs: expressed sequence tags 
TEs: Transposable elements; CEl: Conserved Element 1 
CE2: Conserved Element 2; SINE: Short interspersed 
nuclear elements; LINE: Long interspersed nuclear ele- 
ments; DNA: DNA repeat elements; LTR: Long terminal 
repeat elements, which include retroposons; TFBS: 
Transcriptional factor binding site; CSHL: Cold Spring 
Harbor Lab; HMR: Human, mouse, rat. 



Additional material 



Additional file 1: Pldi and its antisense transcript Al<158810 It's a 
screenshot of the region contains Pldi and Akl58810 from UCSC Browser. 
These two transcripts share about 8000 bps long. From EST data, there is 
a potential antisense region overlapped between first exon of Pldi and 
fourth exon of AkI 58810. 

Additional file 2: The subject genomes for Blastn were taken from 
UCSC. The genomes of 13 vertebrates were downloaded from UCSC 
And the versions of the genome were listed according to the species. 

Additional file 3: The Methylation degree of CpGs in Pldi- AkT 58810 
region. This data was obtained from the forebrain tissue of a lab mouse 
(GSIV1809309). The methylation score in y-axis represents the possibility of 
a CpG site methylated. The x-axis represents the genome position. The 
arrow showed the direction of the transcript CpG sites are enriched in 
the first exon of AkI 58810 in CEl region. Pldi contains few CpG sites near 
transcript start region. The blue arrow shows the direction of the 
transcription. 

Additional file 4: The splicing evidence for transcription Akl58810. 

We Compared Akl588IO, including its introns with mouse EST database 
in NCBI. Several tags could be mapped to Akl588IO exons (in blue 
cycles). And the first splicing site between exon 1 and exon 2 could be 
observed. 

Additional file 5: Potential ORF of AK158810. ORE finder 
demonstrated the potential ORF and its position in the transcript of 
Akl 58810. Green frame represents the potential ORE. Only one potentia 
open reading frames longer than 100 amino adds. And two AUG codons 
with shorter reading frames (about 70 amino adds) precede this long 
ORF. Frame site, position and length were demonstrated aside. 

Additional file 6: Blastn result for PIdi-AkI 5881 0 region in 13 
species. Sequence in Pldi-Aki588i0 region is used as an inquiry to detect 
the homologue sequence in 12 other vertebrates by Blastn. The 
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command for Blastn is 'blastn -db $db -task blastn -db $db -task 
megablast -query Squery -out $out -outfmt 6'. The length of initial exact 
match is at least bigger than 28. 'db' is the reference genome of 12 
species, 'query' is the sequence of PIdi and Akl58810 loci in mouse. 

Additional file 7: Sequence alignment file of 4 key species 

Alignment files contained the regions we picked from the multiz file of 
30 vertebrates to do alignment among dog, human, mouse and rat to 
calculate the substitution rate in Table 3 based on the model raised in 
the Figure 1. The alignment file includes inversed element (Conserved 
element 1), tandem elements (Conserved element 2) and surrounding 
genes. 

Additional file 8: Rna-seq in different tissues of Mouse show this 
PIdl and Ak158810 loci is a dynamic transcriptional state in different 
tissues. Long RNA-seq data from Encode CSHL provided the expression 
level of Pldi-Akl 58810 region in different tissues of mouse from UCSC 
Browser, wide expression signals of PIdi and Alcl 58810 were found in 
testis. In heart, kidney and spleen, similar transcripts in region of 
AI<1588W were enriched. Other tissues did not show specific expression 
of these two transcripts. 
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